Enable automatic URL linking #19110

ryzokuken · 2024-11-26T17:03:41Z

Automatically detect links in the text content of a file and automatically generate link annotations at the appropriate locations to achieve automatic link detection and hyperlinking.

References:

Please note that this is a WIP PR for soliciting your feedback while I work on polishing things and hopefully optimizing further.

web/pdf_page_view.js

Snuffleupagus

How does this perform, especially in documents that contain a lot of text?

Also, we probably want a new option/preference to be able to disable this functionality.

Snuffleupagus · 2024-11-26T17:30:53Z

web/pdf_page_view.js

+  }
+
+  #processLinks() {
+    return this.pdfPage.getTextContent().then(content => {


We absolutely cannot fetch the textContent twice for each rendered page, since that'll be really inefficient in general.
Besides, it isn't necessary since the textContent is already available once the textLayer has rendered; see

pdf.js/web/text_layer_builder.js

Line 99 in 079eb24

this.highlighter?.setTextMapping(textDivs, textContentItemsStr);

pdf.js/web/text_highlighter.js

Lines 47 to 59 in 079eb24

/**

* Store two arrays that will map DOM nodes to text they should contain.

* The arrays should be of equal length and the array element at each index

* should correspond to the other. e.g.

* `items[0] = "<span>Item 0</span>" and texts[0] = "Item 0";

*

* @param {Array<Node>} divs

* @param {Array<string>} texts

*/

setTextMapping(divs, texts) {

this.textDivs = divs;

this.textContentItemsStr = texts;

}

web/pdf_page_view.js

Snuffleupagus · 2024-11-26T17:38:32Z

web/pdf_page_view.js

+  #processLinks() {
+    return this.pdfPage.getTextContent().then(content => {
+      const [text, diffs] = normalizedTextContent(content);
+      const urlRegex = /\b(?:https?:\/\/|mailto:|www.)(?:[[\S--\[]--\p{P}]|\/|[\p{P}--\[]+[[\S--\[]--\p{P}])+/gmv;


Also, the regular expression should probably be created just once (and then cached) to avoid re-creating it for every page.

This might no longer be relevant since it's now a static field on the class.

Snuffleupagus · 2024-11-26T17:40:29Z

web/pdf_page_view.js

+        annotationType: 2,
+        annotationFlags: 4,


These should use actual constants, rather than hard-coded numbers.

Removed annotationFlags and the only remaining hardcoded number now is annotationType. Ideally I should still not be assigning any of these manually and using a constructor.

web/pdf_page_view.js

web/annotation_layer_builder.js

web/autolinker.js

ryzokuken · 2024-12-10T23:20:14Z

@Snuffleupagus

Besides, it isn't necessary since the textContent is already available once the textLayer has rendered; see

I was a bit unsure if I understood exactly what you were suggesting but how does this commit look? It "fetches" the textContents from the previous render of the textLayer and makes the processing step sync.

93e5417

ryzokuken · 2024-12-11T15:10:14Z

@Snuffleupagus nvm my last comment, I figured it out after looking at pdfPageView._textHighlighter.textDivs a couple of times it occurred to me what you were talking about.

https://github.com/mozilla/pdf.js/pull/19110/files#diff-71772f56be799df522c2076ab5fa476253ef15c607af5812536c41696d97cd59R73

Automatically detect links in the text content of a file and automatically generate link annotations at the appropriate locations to achieve automatic link detection and hyperlinking.

…s already exist

…tent

calixteman · 2024-12-24T14:50:24Z

web/annotation_layer_builder.js

+        if (
+          annotation.subtype === "Link" &&
+          annotation.url === link.url &&
+          Util.intersect(annotation.rect, link.rect) !== null


In order to avoid some corner case bug, maybe a better way to do would be to compute the ratio area(intersection) / area(annotation.rect) and if it's greater than a threshold then consider that links are very likely the same.

github-advanced-security bot found potential problems Nov 26, 2024

View reviewed changes

web/pdf_page_view.js Fixed Show fixed Hide fixed

Snuffleupagus requested changes Nov 26, 2024

View reviewed changes

timvandermeij added the viewer label Nov 28, 2024

ryzokuken force-pushed the autolink-demo branch 2 times, most recently from a7d0024 to f3737b5 Compare December 10, 2024 23:11

github-advanced-security bot found potential problems Dec 10, 2024

View reviewed changes

web/autolinker.js Dismissed Show dismissed Hide dismissed

ryzokuken force-pushed the autolink-demo branch 2 times, most recently from cad781f to 93e5417 Compare December 10, 2024 23:16

ryzokuken added 11 commits December 16, 2024 22:06

Enable automatic URL linking

99d90b0

Automatically detect links in the text content of a file and automatically generate link annotations at the appropriate locations to achieve automatic link detection and hyperlinking.

[SQUASHME] Don't inject link annotations from textLayer if annotation…

e543238

…s already exist

[SQUASHME] Fix generated link annotation data to match defaults

0a3d6f9

Add viewer option to enable autolinking

afe1c4e

[SQUASHME] Remove redundant annotation options

40d16f8

only create valid urls, fix lint errors and render multiline annotations

f2a988f

move linking logic to 'static' class

f44093f

make processLinks sync by using textContent from textLayer

ebeee90

make processLinks switch to imperative style to be cleaner and consis…

a20d5d9

…tent

simplify fetching textContents

50ad6b1

fix bug with textContents and check links for duplicates

7849bc0

ryzokuken force-pushed the autolink-demo branch from 517c74e to 7849bc0 Compare December 16, 2024 22:06

ryzokuken marked this pull request as ready for review December 16, 2024 22:10

ryzokuken requested a review from Snuffleupagus December 20, 2024 12:47

calixteman reviewed Dec 24, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable automatic URL linking #19110

Enable automatic URL linking #19110

ryzokuken commented Nov 26, 2024 •

edited

Loading

Snuffleupagus left a comment •

edited

Loading

Snuffleupagus Nov 26, 2024

Snuffleupagus Nov 26, 2024

ryzokuken Dec 10, 2024

Snuffleupagus Nov 26, 2024

ryzokuken Dec 9, 2024

ryzokuken commented Dec 10, 2024

ryzokuken commented Dec 11, 2024

calixteman Dec 24, 2024

	/**
	* Store two arrays that will map DOM nodes to text they should contain.
	* The arrays should be of equal length and the array element at each index
	* should correspond to the other. e.g.
	* `items[0] = "<span>Item 0</span>" and texts[0] = "Item 0";
	*
	* @param {Array<Node>} divs
	* @param {Array<string>} texts
	*/
	setTextMapping(divs, texts) {
	this.textDivs = divs;
	this.textContentItemsStr = texts;
	}

Enable automatic URL linking #19110

Are you sure you want to change the base?

Enable automatic URL linking #19110

Conversation

ryzokuken commented Nov 26, 2024 • edited Loading

Snuffleupagus left a comment • edited Loading

Choose a reason for hiding this comment

Snuffleupagus Nov 26, 2024

Choose a reason for hiding this comment

Snuffleupagus Nov 26, 2024

Choose a reason for hiding this comment

ryzokuken Dec 10, 2024

Choose a reason for hiding this comment

Snuffleupagus Nov 26, 2024

Choose a reason for hiding this comment

ryzokuken Dec 9, 2024

Choose a reason for hiding this comment

ryzokuken commented Dec 10, 2024

ryzokuken commented Dec 11, 2024

calixteman Dec 24, 2024

Choose a reason for hiding this comment

ryzokuken commented Nov 26, 2024 •

edited

Loading

Snuffleupagus left a comment •

edited

Loading